Introduction

The dataset that I am using in this project was found on Kaggle, the well-known Machine Learning Competition website. Click here for a full description of the dataset, or read the description file.

I worked in the automotive industry for 12 years and I remain a devoted pistonhead, so getting a better understanding of the used car market was very appealing.

This project focuses on the exploratory data analysis phase of the dataset. In particular, I will try to detect associations between variables, especially against price. The end-goal of such a project would be to build a price-prediction model for vehicles sold by eBay users.


0. Data Preparation

The dataset is well structured but there are some free text fields and many missing values. Moreover, some of the data is in German and needs to be translated. Most of the translations are straightforward, and Google Translate comes to the rescue where required! The “name” column is problematic: It is free text which causes all sorts of issues, and although a German NLP engineer could perhaps find interesting information in it, I chose to simply drop it.

##            dateCrawled seller offerType price abtest vehicleType
## 1: 2016-03-24 11:52:17 privat   Angebot   480   test          NA
## 2: 2016-03-24 10:58:45 privat   Angebot 18300   test       coupe
## 3: 2016-03-14 12:52:21 privat   Angebot  9800   test         suv
## 4: 2016-03-17 16:54:04 privat   Angebot  1500   test  kleinwagen
## 5: 2016-03-31 17:25:20 privat   Angebot  3600   test  kleinwagen
## 6: 2016-04-04 17:36:23 privat   Angebot   650   test   limousine
##    yearOfRegistration   gearbox powerPS model kilometer
## 1:               1993   manuell       0  golf    150000
## 2:               2011   manuell     190    NA    125000
## 3:               2004 automatik     163 grand    125000
## 4:               2001   manuell      75  golf    150000
## 5:               2008   manuell      69 fabia     90000
## 6:               1995   manuell     102   3er    150000
##    monthOfRegistration fuelType      brand notRepairedDamage dateCreated
## 1:                   0   benzin volkswagen                NA  2016-03-24
## 2:                   5   diesel       audi                ja  2016-03-24
## 3:                   8   diesel       jeep                NA  2016-03-14
## 4:                   6   benzin volkswagen              nein  2016-03-17
## 5:                   7   diesel      skoda              nein  2016-03-31
## 6:                  10   benzin        bmw                ja  2016-04-04
##    nrOfPictures postalCode            lastSeen
## 1:            0      70435 2016-04-07 03:16:57
## 2:            0      66954 2016-04-07 01:46:50
## 3:            0      90480 2016-04-05 12:47:46
## 4:            0      91074 2016-03-17 17:40:17
## 5:            0      60437 2016-04-06 10:17:21
## 6:            0      33775 2016-04-06 19:17:07
## Classes 'data.table' and 'data.frame':   371824 obs. of  19 variables:
##  $ dateCrawled        : POSIXct, format: "2016-03-24 11:52:17" "2016-03-24 10:58:45" ...
##  $ seller             : Factor w/ 2 levels "gewerblich","privat": 2 2 2 2 2 2 2 2 2 2 ...
##  $ offerType          : Factor w/ 2 levels "Angebot","Gesuch": 1 1 1 1 1 1 1 1 1 1 ...
##  $ price              : int  480 18300 9800 1500 3600 650 2200 0 14500 999 ...
##  $ abtest             : Factor w/ 2 levels "control","test": 2 2 2 2 2 2 2 2 1 2 ...
##  $ vehicleType        : Factor w/ 8 levels "andere","bus",..: NA 4 8 5 5 7 3 7 2 5 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 2 levels "automatik","manuell": 2 2 1 2 2 2 2 2 2 2 ...
##  $ powerPS            : int  0 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  0 5 8 6 7 10 8 7 8 0 ...
##  $ fuelType           : Factor w/ 7 levels "andere","benzin",..: 2 4 4 2 4 2 2 2 2 NA ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 2 levels "ja","nein": NA 1 NA 2 2 1 2 2 NA NA ...
##  $ dateCreated        : POSIXct, format: "2016-03-24" "2016-03-24" ...
##  $ nrOfPictures       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ postalCode         : int  70435 66954 90480 91074 60437 33775 67112 19348 94505 27472 ...
##  $ lastSeen           : POSIXct, format: "2016-04-07 03:16:57" "2016-04-07 01:46:50" ...
##  - attr(*, ".internal.selfref")=<externalptr>

The column abtest seems to be internal to E-Bay, probably the control or test groups for some internal A/B testing. I don’t believe we will need it. The column nrOfPictures only contains zeros, probably a data collection issue. We don’t need it either. I will also drop the postalCode column, because I don’t intend to cross-reference the data with a postal map of Germany although that could be another interesting project.

The values in the different factors are fairly straightforward. I translate them into English; at the same time I drop the 12 ads from people looking to purchase a car (offerType == Gesuch), as I don’t have confidence that they would have accurate information about car specifications, nor sensible asking prices. As a result, we no longer need this column.

The seller column contains only 3 professional traders, which is insignificant in comparison to the total number of observations. Therefore I drop the column.

Finally, I also noticed that some zeros should really be NAs: In price, monthOfRegistration, powerPS.

The dateCrawled and dateCreated columns might not be very useful in themselves, but they allow us to calculate how long an ad has been up for on the website, and thus gives us an approximate lower bound for selling time (see below for a discussion on this). By default this value is calculated in minutes, I convert it to days. With this new ad_up_time variable, we no longer need the other date variables.

We now have a clean and usable dataset:

## Classes 'data.table' and 'data.frame':   371809 obs. of  12 variables:
##  $ price              : int  480 18300 9800 1500 3600 650 2200 NA 14500 999 ...
##  $ vehicleType        : Factor w/ 8 levels "other","people carrier",..: NA 4 8 5 5 7 3 7 2 5 ...
##  $ yearOfRegistration : int  1993 2011 2004 2001 2008 1995 2004 1980 2014 1998 ...
##  $ gearbox            : Factor w/ 2 levels "automatic","manual": 2 2 1 2 2 2 2 2 2 2 ...
##  $ powerPS            : int  NA 190 163 75 69 102 109 50 125 101 ...
##  $ model              : Factor w/ 251 levels "100","145","147",..: 119 NA 120 119 104 12 9 41 62 119 ...
##  $ kilometer          : int  150000 125000 125000 150000 90000 150000 150000 40000 30000 150000 ...
##  $ monthOfRegistration: int  NA 5 8 6 7 10 8 7 8 NA ...
##  $ fuelType           : Factor w/ 7 levels "other","petrol",..: 2 4 4 2 4 2 2 2 2 NA ...
##  $ brand              : Factor w/ 40 levels "alfa_romeo","audi",..: 39 2 15 39 32 3 26 39 11 39 ...
##  $ notRepairedDamage  : Factor w/ 2 levels "yes","no": NA 1 NA 2 2 1 2 2 NA NA ...
##  $ ad_up_time         : num  818.53 817 1319.81 1.93 342.17 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "index")= int

1. Description of Individual Variables

Here, I look at each of the variables independently to understand their distribution.

##      price                   vehicleType    yearOfRegistration
##  Min.   :1.000e+00   sedan         :95962   Min.   :1000      
##  1st Qu.:1.250e+03   small car     :80095   1st Qu.:1999      
##  Median :3.000e+03   estate        :67625   Median :2003      
##  Mean   :1.780e+04   people carrier:30218   Mean   :2005      
##  3rd Qu.:7.490e+03   convertible   :22914   3rd Qu.:2008      
##  Max.   :2.147e+09   (Other)       :37103   Max.   :9999      
##  NA's   :10779       NA's          :37892                     
##       gearbox          powerPS            model          kilometer     
##  automatic: 77168   Min.   :    1.0   golf   : 30085   Min.   :  5000  
##  manual   :274424   1st Qu.:   80.0   other  : 26422   1st Qu.:125000  
##  NA's     : 20217   Median :  116.0   3er    : 20580   Median :150000  
##                     Mean   :  129.8   polo   : 13105   Mean   :125620  
##                     3rd Qu.:  150.0   corsa  : 12584   3rd Qu.:150000  
##                     Max.   :20000.0   (Other):248538   Max.   :150000  
##                     NA's   :40851     NA's   : 20495                   
##  monthOfRegistration    fuelType                brand       
##  Min.   : 1.00       petrol :224032   volkswagen   : 79693  
##  1st Qu.: 4.00       diesel :107824   bmw          : 40298  
##  Median : 6.00       lpg    :  5382   opel         : 40165  
##  Mean   : 6.38       cng    :   571   mercedes_benz: 35345  
##  3rd Qu.: 9.00       hybrid :   280   audi         : 32897  
##  Max.   :12.00       (Other):   312   ford         : 25591  
##  NA's   :37701       NA's   : 33408   (Other)      :117820  
##  notRepairedDamage   ad_up_time    
##  yes : 36308       Min.   :   0.0  
##  no  :263384       1st Qu.: 116.6  
##  NA's: 72117       Median : 349.7  
##                    Mean   : 515.3  
##                    3rd Qu.: 798.4  
##                    Max.   :1981.8  
## 

We have 12 variables which I can now plot individually.

1.1. Price

From the summary above, we see that prices go up to over \(€2.10^9\)! This is obviously wrong. While looking at all cars over €100,000 in more detail, I noticed that many of these prices seemed either entered at random or confused with kilometers: I found patterns such as ‘111111’ or ‘12345678’, or mainstream models over €150,000. To try and filter out most of these issues, I assumed that such high-end cars would most likely be coupes, convertibles or SUVs. I also dropped any observation above €200,000, assuming the majority of them would be input errors. I then dropped any row that did not match these criteria and looked at the brands of cars above €75,000:

##  [1] chevrolet     porsche       other         mercedes_benz bmw          
##  [6] volkswagen    audi          jaguar        seat          ford         
## [11] land_rover    opel          renault       smart         nissan       
## [16] mitsubishi   
## 40 Levels: alfa_romeo audi bmw chevrolet chrysler citroen dacia ... volvo

Some of these brands are not considered premium and it is surprising to find them here. Let’s see the models and prices for the non-premium brands (Volkswagen, Seat, Ford, Opel, Renault, Smart, Nissan, Mitsubishi):

##      price    vehicleType yearOfRegistration   gearbox powerPS       model
##  1:  99999             NA               1910        NA      NA          NA
##  2:  99999             NA               1970        NA      NA       other
##  3: 123456             NA               2000    manual      75        golf
##  4: 154651             NA               2005        NA      NA       altea
##  5: 130000          coupe               1968        NA      NA     mustang
##  6: 100000          coupe               1968    manual     131       other
##  7:  93000          coupe               1971    manual      NA       other
##  8: 100000      small car               2013 automatic      NA       other
##  9: 100000             NA               2000        NA      NA          NA
## 10:  78000            SUV               2015 automatic     340     touareg
## 11:  99999            SUV               2001    manual     114     x_trail
## 12:  99999             NA               1910        NA      NA          NA
## 13: 130000             NA               2000        NA      NA          NA
## 14: 120000          coupe               1967    manual     550     mustang
## 15:  99999             NA               2017    manual     204        golf
## 16:  85000 people carrier               1967    manual      44 transporter
## 17: 123456             NA               1985        NA      NA        golf
## 18:  79499            SUV               2015 automatic     340     touareg
## 19:  99999      small car               2009 automatic      71      fortwo
##     kilometer monthOfRegistration fuelType      brand notRepairedDamage
##  1:    150000                  NA       NA volkswagen                NA
##  2:    150000                  NA       NA volkswagen                NA
##  3:    150000                   7       NA volkswagen                NA
##  4:    150000                  NA       NA       seat                NA
##  5:     50000                   7   petrol       ford                NA
##  6:    100000                   5   petrol       opel                no
##  7:     30000                   2   petrol    renault                no
##  8:    150000                  11 electric      smart                no
##  9:    150000                  NA       NA volkswagen                NA
## 10:     50000                   4   diesel volkswagen                no
## 11:    150000                   9   diesel     nissan                no
## 12:    150000                  NA       NA volkswagen                NA
## 13:    150000                  NA       NA mitsubishi                NA
## 14:    100000                   9   petrol       ford                no
## 15:    150000                   2       NA volkswagen                NA
## 16:    150000                   1   petrol volkswagen                no
## 17:    125000                  NA       NA volkswagen                NA
## 18:     20000                   5   diesel volkswagen                no
## 19:     50000                  10   petrol      smart                no
##     ad_up_time
##  1: 1341.10625
##  2:   89.24861
##  3:  207.49583
##  4:   12.56250
##  5:  932.87153
##  6:    0.00000
##  7:  664.18125
##  8:    0.00000
##  9: 1791.43472
## 10: 1817.59167
## 11: 1441.67153
## 12:  125.89375
## 13: 1087.41875
## 14:  904.70556
## 15:  802.15972
## 16: 1592.41250
## 17:  854.05069
## 18: 1642.35347
## 19:  744.62014

So we find a variety of models there, some unnamed. The VW Touareg and Ford Mustang seem legitimate. There are also some really old cars, for which a high price might be justified to a collector, but in most cases the model name is not mentioned for these so it is difficult to say whether they are genuinely expensive cars or errors. Since we have more than enough observations overall, I decide to drop them and only retain the VW Touareg and Ford Mustang.

There are also cars below €100, which I assume are also errors or sellers not wanting to filter themselves out of the price selector on the website. I drop these too.

I then plot the variable again with log plus 1 transformation on the x-axis:
Red bar indicates mean, blue bar indicates median

Red bar indicates mean, blue bar indicates median

With the scale transformation, we have a roughly normal distribution. We notice that some bins have a much higher count than their neighbours, presumably corresponding to round values or “psychological” prices (e.g. €9,900).

1.2. Vehicle Type

This categorical variable has 8 levels and indicates the body style of the car (sedan, coupe, SUV etc.)

This tends to reflect the general West-European market, with a prominence of “family” vehicles and smaller volumes of “niche” products (although a sample of new car registrations would probably show a higher proportion of SUVs considering the rise of this body style in recent years). Also note the large number of NAs – about 20,000. E-Bay could definitely do a better job at encouraging their customers to write their ads properly.

1.3. Year of Registration

This will basically tell us about the age of the vehicle. From the data summary, we saw that the minimum year is 1,000, which will come as a surprise to most historians. The maximum year is 9,999, which is obviously wrong as we will all be teleporting by then. So we clearly need to tidy up this variable. To keep things simple, I select only vehicles registered since 1960. As the data was collected in 2016, we set that year as our upper bound. Then we plot a histogram.

Red bar indicates mean, blue bar indicates median

Red bar indicates mean, blue bar indicates median

Note that the mean and the median are identical. The distribution is close to normal, with the following exceptions:

  • A significant left tail
  • Three peaks: one in 1999-2000, one in 2005-2006 and a large one in 2016.

I did some research and it turns out that 1999, 2000, 2005, 2006 were all among the strongest years for new car registrations in Germany in the last 20 years. As they are also in the heart of the used car market in terms of age, it makes sense that they would translate into these peaks.
As for 2016, the explanation is less obvious, especially as the data was collected in March and April, which is quite ealy in the year. The peak could be due to some listing errors (on purpose or not) where owners enter a date at random or to attract visitors to their ad. It could also be linked to the website’s features when creating an ad (eg. default value in drop-down menu).
But there could also be a number of genuine 2016 cars suddenly arriving on the market. Employees in the automotive industry have often access to cheap car leasing schemes, whereby they can change their vehicle every 6 months or so. More importantly, most manufacturers register large numbers of demonstrators, press units and self-registered cars (new vehicles registered by the manufacturer or its dealers, in order to artificially boost market share and / or create cheaper opportunities to capture some customers over the competition).

1.4. Month of Registration

We should note from the summary in the beginning that monthOfRegistration contains nearly 38,000 NAs. Again, it is surprising that year seems mandatory (although its value is clearly not controlled) but month is not.

March and June are the strongest month for vehicle registrations. A quick research on the internet confirmed that this is consistent with the car registration seasonality that we observe in Germany (incidently, there is a similar effect in France and the UK).

1.5. Transmission Type

The gearbox variable can only take two values: manual or automatic.

The European market is primarily a manual transmission market, so no surprises here. Again, there are about 15,000 NAs.

1.6. Engine Power

The engine power is measured using the metric PS (1PS = 736 Watts). Again, we know from the summary that there are some nonsensical values in the data. I was prepared to remove anything below 40PS, but then I realised that there are Trabants in the dataset! This venerable left-over from the East-German Communist era is now widely used in Berlin as a rental car for people looking for a different experience of the city. Its 2-stroke engine managed 26PS!. Out of respect for such an antiquity, I decided to set the lower bound at 25PS.

As for the excessively high PS values, it looks like most of them are due to people confusing power output and engine capacity (in cm\(^3\)). I decided to set the limit at 600PS, a more than respectable value.

Red bar indicates mean, blue bar indicates median

Red bar indicates mean, blue bar indicates median

The distribution is postively skewed with a long tail (that may contain errors, as we just saw).

There are some prefered values – around 60, 100, 120, 140 for instance. These are values that have become some sort of “market standards”: Most manufacturers will offer engines around these values. It makes it easier for the consumer to compare products. Real, measurable power output of a car is never exactly equal to its rated horsepower due to variance in industrial processes. Differences of around 5% are not uncommon for a same model.

1.7. Model

This factor variable contails 251 levels, far too many to plot. But we can select the top 20:

Without surprise, the Volkswagen Golf (the most popular car in Europe) is also number one in the dataset. Note the very large number of vehicles designated as “other” – while some of them are probably models that exist but cannot be selected on the eBay website, it is unlikely that their number would be that high so we have to assume that once again, they are mostly due to human error.

The next car on the list is the BMW 3-Series, which is by no means a cheap car. This fact alone shows that we are indeed working on the German market – the most high-end market in Europe.

1.8. Mileage

… or more accurately, “kilometreage”.

Red bar indicates mean, blue bar indicates median

Red bar indicates mean, blue bar indicates median

This histogram clearly shows that mileage is not a free input field. This is surprising, because mileage is one of the most important pieces of information when it comes to used cars, so maximum accuracy would have been desirable. Moreover, when I went to E-Bay myself to try out the used car ad generator, I was able to enter any value. So maybe there is some aggregationg mechanism during data extract or the functionality changed recently?

The second thing that is striking with this chart is the predominance of 150,000km cars. Given the nature of the data, there is no point trying to apply scale transformations to improve the plot. This variable should actually be considered as a categorical variable more than a continuous one. I therefore add a variable in the dataset called km_cat. We will see later which one is more appropriate.

1.9. Fuel Type

This factor variable also contains many NAs (over 33,000).

“Alternative”" sources of energy are almost negligible in this dataset, which is not surprising considering that over 50% of the vehicles were 12 to 13 years old when this data was extracted.

Petrol is roughly twice as prominent as Diesel.

1.10. Brand

This is another factor with many levels (40) so we will take the same approach as with model.

The top 5 brands are German. Number 6 is Ford, which in Europe is largely perceived as German as it has its European headquarters in Cologne and many of its European products are actually designed and built in Germany. The next two brands are French, then Fiat is Italian. Seat is Spanish but it is actually part of the VW Group and their cars share almost all their components with VW products.
In other words, German manufacturers are hugely dominant on their home turf.

One issue we will have when looking for associations between variables is that with brand containing 40 levels, plots will be really hard to read (in addition to causing long processing times). We can improve this by grouping brands into categories based on brand perception. We could simply use mean prices to make these distinctions, but we would then create a correlation to price where there isn’t necessarily one. Moreover, brand perception involves a lot more than just price – there is history, perceived quality, marketing etc.

So I chose another approach: Use my domain-knowledge to manually classify the 40 levels into clusters. I didn’t plan on an exact number of clusters beforehand, just something manageable. I then intuitively grouped brands together and came up with 8 clusters which I then named.
Althought this intuitive approach is subjective, I believe it actually adds information to the dataset, unlike the “group by price” method which removes some.

There are large disparities between brand categories in terms of count. The biggest surprise is the predominance of the premium brands, which are supposedly the most expensive. But the analysis of the top 10 brands above explains it: This category contains BMW, Audi and Mercedes which are all among the most common brands in Germany. Clearly, this plot would look very different in most other countries.

1.11. Unrepaired Damage

The variable notRepairedDamage can only take two values: “yes” or “no”. But it does have NAs – about 72,000, which is twice as many as the number of “yes”. It does not seem like this is a mandatory field (and I could not find it on EBay). As I understand it, it refers to potential unrepaired damage on the vehicle being sold.

1.12. Ad Up-time

This is a composite variable that we created by substracting the dateCreated from the dateCrawled date. It is measured in days. Of course the idea is to look for potential correlations to other variables, especially price. From that point of view, there are important limitations associated with ad_up_time:

  • We don’t know whether there is any incitation for sellers to remove ads once the vehicle is sold. This means some ads might be really old and unattended despite the car having been sold long ago, through eBay or some other channel.
  • We are not even sure that ads removed from the website cannot be present in the scaped data.
Red bar indicates mean, blue car indicates median

Red bar indicates mean, blue car indicates median

Here we have observations up to 2,000 days (nearly 5.5 years). The long tail contains a significant number of observations and there seems to be a lot of variance in the data, so it is difficult to just drop observations over an arbitrary number of days posted.
The other thing I notice is that there seems to be some prevailing values at roughly three months intervals. I am not sure why this is – it might be related to the pricing structure used by eBay…?


2. Multivariate Analyses

In this section, I will examine pairs of variables to look for associations between variables, and then zoom in on some multi-variable combinations that seem particularly interesting.

First, let’s visualize potential variable associations with a plot matrix (to reduce computing time, I use a sample of 10,000 observations):

Even so the plot matrix is very clutered so we will build individual plots for the most promising variable pairs. Eventually I would like to be able to predict prices based on the other variables, so let’s focus on price as one of our variables.

2.1. Continuous variables

2.1.1. Price vs. Power, by Fuel Type

The plot matrix above shows that these two variables have the highest linear correlation so let’s start here. I make a scatterplot of price vs PowerPS with a colour-coding for fuelType (and keeping only the two most popular: petrol and Diesel – upper-case D is intentional as Diesel is the name of the inventor, Rudolf Diesel):

There clearly is a correlation between power and price, which does not seem perfectly linear but rather looks a little like some kind of root or log function. It also seems that for a given power output, Diesel cars are more expensive (not a great surprise considering that modern Diesel powertrains are usually more technologically advanced), but it is hard to tell from this plot.

Taking the log of powerPS definitely helps make the relationship look more linear for petrol cars, but it does not have very convincing results on Diesel cars. This means that when building the linear model, we will need a different set of parameters for each fuel type.

I will separate fuel types, and since we are dealing with technical specifications of vehicles, I will add an extra dimension with transmission type (gearbox):

We see that petrol cars are more spread out in terms of price and power than Diesel cars. Using the log of power, the smoothers are close to linear in the first category but not in the second one. We also notice that automatic cars tend to be both more powerful and more expensive than manual, but this will need to be confirmed later.

##    log_price_cor
## 1:     0.5837277
##    fuelType   gearbox log_price_cor
## 1:   diesel    manual     0.4795269
## 2:   diesel automatic     0.4939476
## 3:   petrol    manual     0.4917658
## 4:   petrol automatic     0.5857408

So when grouped by transmission and fuel types, the correlations are all around the .5 mark, except the Petrol / Automatic combination which is around .6. We see that the associations are similar in shape although for manual Diesel cars, the data points form a much tighter group so it is harder to tell.

So far, we have found that there is a positive correlation of price with power, and perhaps associations to gearbox and fuel type.

2.1.2. Price vs. Year of Registration

Another continuous variable with a significant correlation to price (according to our plot matrix) is yearOfRegistration.

The distribution looks very strange.

  • The first feature that I notice is a clear positive correlation that looks roughly linear for cars registered from roughly 1992 or 1995 onwards.
  • Secondly, we see a vertical line of observations in 2016, which we had already noticed when looking at yearOfRegistration alone. On the scatter plot, we can see that this line includes cars at just about any price from a couple hundred Euros to maybe €20,000. Strangely, the bulk of cars just older than this seems to be more expensive, which suggests that they are most probably errors (intentional or not) or people just picking the first year available on the drop-down menu when listing their car. If these cars were really 2016 cars, it would be illogical for them to be advertised at cheaper prices than cars 5 years older. This strongly advocates in favour of filtering 2016 cars out of the data.
  • Thirdly, it seems that cars registered before the early 1990’s become more expensive the older they get. This would be consistent with vintage vehicles that tend to gain value as they get older – typically after 20-25 years, provided they are in good condition and were not too common to start with.

Let’s first split the data in two with a cut-off in 1995 (21 years ago) and remove the 2016 cars from the dataset (there is only 4 months’ worth of data for that year and it seems highly inaccurate). I then plot the new data with a distinction beween vintage and modern cars.

Now I would like to take a look at the correlation values:

## [1] "Vintage cars:"
## 
##  Pearson's product-moment correlation
## 
## data:  price and yearOfRegistration
## t = -69.172, df = 28265, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3904196 -0.3704793
## sample estimates:
##        cor 
## -0.3804937
## [1] "Modern cars:"
## 
##  Pearson's product-moment correlation
## 
## data:  price and yearOfRegistration
## t = 415.11, df = 276410, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6173838 0.6219766
## sample estimates:
##       cor 
## 0.6196855

The correlation is much weaker for vintage cars than for modern cars because of the greater variance. This variance comes partly from the lower count, but also probably because vintage car prices can vary enormously accoriding to seemingly irrational factors. They no longer compete with each other so market pressure is different. Moreover their value depends on factors such as: rarity, quality of restoration, use of original parts only, authenticity (faithfulness to the exact specifications of the car when it came out of the factory), historical value, part usage, car’s history etc.

Two cars built exactly the same year, 40 years ago, and selling at the same price back then, can nowadays have orders of magnitude between their current values.

2.1.3. Price vs. mileage

In the previous section, I noticed that the variable kilometer behaves more like a categorical variable than a continuous one. However I want to try plotting both it to see which representation works best.

There is an association to log_price that looks roughly linear, except between 5,000 and 20,000km. The clearest feature is that the variance increases with mileage. Let’s compare with a boxplot, this time treating mileage as a categorical variable:

Brown diamonds represent means

Brown diamonds represent means

Despite kilometer being grouped in bins, I believe the scatter plot is a marginally better representation because the bins are not regularly spaced, so the box plot gives a distorted view.

The 5000km data looks highly suspicious - we could be observing the same effect as with yearOfRegistration == 2016. Let’s check for that in the next set of analyses:

2.1.4. Year of Registration vs. Mileage

The smoother seems to indicate that, similarly to what we observed on price, kilometers tend to be positively correlated to the age of the car up to 15-20 years old, then negatively correlated after that. This could also be linked to the fact that vintage cars don’t tend to run as much, as they are rarely the household’s main car and are often of questionable reliability. However considering the dispersion in the early years of the dataset, I am not sure I should lend much credit to this observation. Modern cars, on the other hand, generally have a higher mileage as they get older, as common sense would have it.

We also notice a group of observations at 5,000 kilometers between 1995 and 2005 approximately, that do not seem to follow the general distribution. Since that period is precisely the one with the highest density of observations in the dataset, I suspect these cars actually have a much higher mileage and that the data is wrong. These are probably the same suspicious observations that we noticed just before, for which kilometers are most likely severely under-valued.

2.1.5. Ad Uptime vs. All Other Continuous Variables

From the plots below, it doesn’t look like ad_up_timeis going to be very informative. The matrix plot also reports that there is virtually no correlation to any other continuous variable.

If anything, the very tenuous trends we observe are rather counter-intuitive: cars with a low ad_up_time seem to be generally cheaper and to have higher mileage than the rest.

2.2. Discrete variables

In this section, we are going to look at combinations of discrete variables, with a stronger focus on price as this is the variable that I would like to explain.

2.2.1. Price vs. Transmission vs. Fuel Type vs. Gearbox

Brown diamonds represent means

Brown diamonds represent means

This boxplot confirms that prices are generally higher for auto transmissions than for manuals, and also higher for Diesel cars than petrol. However there is more dispersion for auto and Diesel than for manual and petrol, probably due to their lower count in the data. We also notice that there are many outliers. Here I used a linear scale for price but cut off at €50,000, and there are many more outliers above the cut-off point that cannot be seen on the plotting area.

2.2.2. Price vs. Vehicle Type

Brown diamonds represent means

Brown diamonds represent means

As expected, different vehicle types have different price distributions, most of them approaching normal when viewed on a logarithmic scale. The highest mean and median prices are found with SUVs, followed by convertibles, coupés and people carriers (minivans in the US). The variances of these distributions are quite large, coupés in particular. However this looks like a good contributor in explaining price differences.

2.2.3. Price vs. Brand category

Since I took the time to manually classify brands by perceived “premiumness”, let’s have a look at potential associations with price:

Brown diamonds represent means

Brown diamonds represent means

In general, we observe a fairly logical pattern with higher prices for more premium brands. The only surpise is that the “budget” and “budget_plus” categories seem almost as expensive overall as the “premium_minus” brands. I am not sure why, maybe this has to do with the fact that some of the brands that make up these two categories are fairly recent and therefore have a younger population?

Brown diamonds represent means

Brown diamonds represent means

## [1] "Median year of registration by brand category:"
##        brand_cat   V1
## 1:       premium 2004
## 2: premium_minus 2004
## 3:      mid_plus 2003
## 4:   budget_plus 2007
## 5:     mid_range 2003
## 6:     mid_minus 2003
## 7:        budget 2008
## 8:         other 1999

It seems that my intuition was correct. These two categories are younger than the rest (with a 4-5 year difference in the medians) which would explain at least part of the observation we made previously.

2.2.4. Price vs. Unrepaired Damage

So cars with unrepaired damage are much cheaper on average than cars in good condition. The log10 scale is slightly deceiving here, but in reality there is a 1-to-3 to 1-to-4 difference.

However we should note that the notion of unrepaired damage is somewhat vague. No one expect 10- or 15-year-old cars to be in immaculate condition: They will always have some scratches and bumps. However these are unlikely to drop the car’s price by a factor 3 so I conclude that to most sellers, unrepaired damage means serious damage, potentially preventing the vehicule from operating normally. Is this definition presented in the eBay guidelines? Or is it just an implicit understanding from sellers? I tried to navigate the eBay website for this information but was unable to find it.


3. Linear Regression

Based on the knowledge that we gained on the previous phases of the project, I would like to attempt a linear regression to predict prices based on the most useful variables. To clean up the data further, I decided to remove all observations that are in the 5,000km group and were registered between 1990 and 2010. I decided to select the following variables:

## 
## Call:
## lm(formula = log_price ~ (powerPS:fuelType + I(log(powerPS)):fuelType + 
##     gearbox + yearOfRegistration:collector_status + kilometer:collector_status + 
##     vehicleType + brand_cat):notRepairedDamage, data = cars_no_nas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.33514 -0.10928  0.01774  0.12433  2.98695 
## 
## Coefficients: (1 not defined because of singularities)
##                                                                   Estimate
## (Intercept)                                                     -7.030e+01
## gearboxautomatic:notRepairedDamageyes                           -2.374e+01
## gearboxmanual:notRepairedDamageyes                              -2.374e+01
## gearboxautomatic:notRepairedDamageno                             7.917e-03
## gearboxmanual:notRepairedDamageno                                       NA
## vehicleTypepeople carrier:notRepairedDamageyes                   6.816e-02
## vehicleTypeconvertible:notRepairedDamageyes                      1.853e-01
## vehicleTypecoupe:notRepairedDamageyes                            6.781e-02
## vehicleTypesmall car:notRepairedDamageyes                       -4.818e-02
## vehicleTypeestate:notRepairedDamageyes                          -8.592e-02
## vehicleTypesedan:notRepairedDamageyes                           -1.424e-02
## vehicleTypeSUV:notRepairedDamageyes                              1.738e-01
## vehicleTypepeople carrier:notRepairedDamageno                    6.198e-02
## vehicleTypeconvertible:notRepairedDamageno                       1.519e-01
## vehicleTypecoupe:notRepairedDamageno                             5.488e-02
## vehicleTypesmall car:notRepairedDamageno                        -2.025e-02
## vehicleTypeestate:notRepairedDamageno                           -6.427e-02
## vehicleTypesedan:notRepairedDamageno                            -4.017e-02
## vehicleTypeSUV:notRepairedDamageno                               9.883e-02
## brand_cat.L:notRepairedDamageyes                                 1.036e-01
## brand_cat.Q:notRepairedDamageyes                                -5.565e-02
## brand_cat.C:notRepairedDamageyes                                -1.097e-01
## brand_cat^4:notRepairedDamageyes                                -7.563e-02
## brand_cat^5:notRepairedDamageyes                                 7.784e-03
## brand_cat^6:notRepairedDamageyes                                -8.997e-02
## brand_cat^7:notRepairedDamageyes                                -9.955e-02
## brand_cat.L:notRepairedDamageno                                  1.093e-01
## brand_cat.Q:notRepairedDamageno                                 -5.271e-02
## brand_cat.C:notRepairedDamageno                                 -9.334e-02
## brand_cat^4:notRepairedDamageno                                 -8.480e-02
## brand_cat^5:notRepairedDamageno                                  3.428e-02
## brand_cat^6:notRepairedDamageno                                 -4.776e-02
## brand_cat^7:notRepairedDamageno                                 -6.072e-02
## powerPS:fuelTypeother:notRepairedDamageyes                      -9.668e-04
## powerPS:fuelTypepetrol:notRepairedDamageyes                      1.553e-03
## powerPS:fuelTypecng:notRepairedDamageyes                        -1.649e-03
## powerPS:fuelTypediesel:notRepairedDamageyes                      1.296e-03
## powerPS:fuelTypeelectric:notRepairedDamageyes                    3.791e-01
## powerPS:fuelTypehybrid:notRepairedDamageyes                      4.343e-04
## powerPS:fuelTypelpg:notRepairedDamageyes                         1.455e-03
## powerPS:fuelTypeother:notRepairedDamageno                        4.973e-04
## powerPS:fuelTypepetrol:notRepairedDamageno                       1.192e-03
## powerPS:fuelTypecng:notRepairedDamageno                          1.318e-03
## powerPS:fuelTypediesel:notRepairedDamageno                       7.701e-04
## powerPS:fuelTypeelectric:notRepairedDamageno                    -4.869e-03
## powerPS:fuelTypehybrid:notRepairedDamageno                      -3.127e-04
## powerPS:fuelTypelpg:notRepairedDamageno                          6.192e-04
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes               2.419e-01
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes              1.450e-01
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                 2.275e-01
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes              1.836e-01
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes           -5.754e+00
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes              2.108e-01
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                 1.530e-01
## fuelTypeother:I(log(powerPS)):notRepairedDamageno                2.691e-01
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno               2.499e-01
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                  2.608e-01
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno               2.874e-01
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno             4.177e-01
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno               3.355e-01
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                  2.706e-01
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes   4.826e-02
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes  4.868e-02
## yearOfRegistration:collector_statusmodern:notRepairedDamageno    3.635e-02
## yearOfRegistration:collector_statusvintage:notRepairedDamageno   3.687e-02
## collector_statusmodern:kilometer:notRepairedDamageyes           -2.584e-06
## collector_statusvintage:kilometer:notRepairedDamageyes          -6.944e-06
## collector_statusmodern:kilometer:notRepairedDamageno            -2.856e-06
## collector_statusvintage:kilometer:notRepairedDamageno           -8.370e-06
##                                                                 Std. Error
## (Intercept)                                                      2.874e-01
## gearboxautomatic:notRepairedDamageyes                            9.520e-01
## gearboxmanual:notRepairedDamageyes                               9.522e-01
## gearboxautomatic:notRepairedDamageno                             1.399e-03
## gearboxmanual:notRepairedDamageno                                       NA
## vehicleTypepeople carrier:notRepairedDamageyes                   1.564e-02
## vehicleTypeconvertible:notRepairedDamageyes                      1.696e-02
## vehicleTypecoupe:notRepairedDamageyes                            1.661e-02
## vehicleTypesmall car:notRepairedDamageyes                        1.535e-02
## vehicleTypeestate:notRepairedDamageyes                           1.522e-02
## vehicleTypesedan:notRepairedDamageyes                            1.515e-02
## vehicleTypeSUV:notRepairedDamageyes                              1.767e-02
## vehicleTypepeople carrier:notRepairedDamageno                    7.317e-03
## vehicleTypeconvertible:notRepairedDamageno                       7.394e-03
## vehicleTypecoupe:notRepairedDamageno                             7.490e-03
## vehicleTypesmall car:notRepairedDamageno                         7.280e-03
## vehicleTypeestate:notRepairedDamageno                            7.231e-03
## vehicleTypesedan:notRepairedDamageno                             7.203e-03
## vehicleTypeSUV:notRepairedDamageno                               7.545e-03
## brand_cat.L:notRepairedDamageyes                                 2.728e-02
## brand_cat.Q:notRepairedDamageyes                                 2.728e-02
## brand_cat.C:notRepairedDamageyes                                 2.229e-02
## brand_cat^4:notRepairedDamageyes                                 1.526e-02
## brand_cat^5:notRepairedDamageyes                                 9.606e-03
## brand_cat^6:notRepairedDamageyes                                 6.831e-03
## brand_cat^7:notRepairedDamageyes                                 4.973e-03
## brand_cat.L:notRepairedDamageno                                  9.442e-03
## brand_cat.Q:notRepairedDamageno                                  9.482e-03
## brand_cat.C:notRepairedDamageno                                  7.692e-03
## brand_cat^4:notRepairedDamageno                                  5.222e-03
## brand_cat^5:notRepairedDamageno                                  3.205e-03
## brand_cat^6:notRepairedDamageno                                  2.228e-03
## brand_cat^7:notRepairedDamageno                                  1.654e-03
## powerPS:fuelTypeother:notRepairedDamageyes                       1.301e-03
## powerPS:fuelTypepetrol:notRepairedDamageyes                      1.061e-04
## powerPS:fuelTypecng:notRepairedDamageyes                         1.909e-03
## powerPS:fuelTypediesel:notRepairedDamageyes                      1.352e-04
## powerPS:fuelTypeelectric:notRepairedDamageyes                    2.203e-01
## powerPS:fuelTypehybrid:notRepairedDamageyes                      1.321e-03
## powerPS:fuelTypelpg:notRepairedDamageyes                         2.301e-04
## powerPS:fuelTypeother:notRepairedDamageno                        9.492e-04
## powerPS:fuelTypepetrol:notRepairedDamageno                       2.977e-05
## powerPS:fuelTypecng:notRepairedDamageno                          5.776e-04
## powerPS:fuelTypediesel:notRepairedDamageno                       3.821e-05
## powerPS:fuelTypeelectric:notRepairedDamageno                     8.594e-04
## powerPS:fuelTypehybrid:notRepairedDamageno                       2.909e-04
## powerPS:fuelTypelpg:notRepairedDamageno                          7.709e-05
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes               5.417e-02
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes              1.462e-02
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                 4.638e-02
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes              1.488e-02
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes            3.471e+00
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes              3.945e-02
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                 1.586e-02
## fuelTypeother:I(log(powerPS)):notRepairedDamageno                2.495e-02
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno               4.656e-03
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                  1.423e-02
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno               4.737e-03
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno             2.460e-02
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno               9.731e-03
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                  5.209e-03
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes   4.540e-04
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes  4.602e-04
## yearOfRegistration:collector_statusmodern:notRepairedDamageno    1.439e-04
## yearOfRegistration:collector_statusvintage:notRepairedDamageno   1.462e-04
## collector_statusmodern:kilometer:notRepairedDamageyes            6.762e-08
## collector_statusvintage:kilometer:notRepairedDamageyes           1.661e-07
## collector_statusmodern:kilometer:notRepairedDamageno             1.733e-08
## collector_statusvintage:kilometer:notRepairedDamageno            5.900e-08
##                                                                  t value
## (Intercept)                                                     -244.600
## gearboxautomatic:notRepairedDamageyes                            -24.940
## gearboxmanual:notRepairedDamageyes                               -24.930
## gearboxautomatic:notRepairedDamageno                               5.660
## gearboxmanual:notRepairedDamageno                                     NA
## vehicleTypepeople carrier:notRepairedDamageyes                     4.357
## vehicleTypeconvertible:notRepairedDamageyes                       10.928
## vehicleTypecoupe:notRepairedDamageyes                              4.082
## vehicleTypesmall car:notRepairedDamageyes                         -3.139
## vehicleTypeestate:notRepairedDamageyes                            -5.647
## vehicleTypesedan:notRepairedDamageyes                             -0.940
## vehicleTypeSUV:notRepairedDamageyes                                9.835
## vehicleTypepeople carrier:notRepairedDamageno                      8.470
## vehicleTypeconvertible:notRepairedDamageno                        20.549
## vehicleTypecoupe:notRepairedDamageno                               7.327
## vehicleTypesmall car:notRepairedDamageno                          -2.782
## vehicleTypeestate:notRepairedDamageno                             -8.888
## vehicleTypesedan:notRepairedDamageno                              -5.577
## vehicleTypeSUV:notRepairedDamageno                                13.099
## brand_cat.L:notRepairedDamageyes                                   3.798
## brand_cat.Q:notRepairedDamageyes                                  -2.040
## brand_cat.C:notRepairedDamageyes                                  -4.923
## brand_cat^4:notRepairedDamageyes                                  -4.958
## brand_cat^5:notRepairedDamageyes                                   0.810
## brand_cat^6:notRepairedDamageyes                                 -13.170
## brand_cat^7:notRepairedDamageyes                                 -20.019
## brand_cat.L:notRepairedDamageno                                   11.577
## brand_cat.Q:notRepairedDamageno                                   -5.559
## brand_cat.C:notRepairedDamageno                                  -12.134
## brand_cat^4:notRepairedDamageno                                  -16.240
## brand_cat^5:notRepairedDamageno                                   10.695
## brand_cat^6:notRepairedDamageno                                  -21.441
## brand_cat^7:notRepairedDamageno                                  -36.707
## powerPS:fuelTypeother:notRepairedDamageyes                        -0.743
## powerPS:fuelTypepetrol:notRepairedDamageyes                       14.633
## powerPS:fuelTypecng:notRepairedDamageyes                          -0.864
## powerPS:fuelTypediesel:notRepairedDamageyes                        9.586
## powerPS:fuelTypeelectric:notRepairedDamageyes                      1.721
## powerPS:fuelTypehybrid:notRepairedDamageyes                        0.329
## powerPS:fuelTypelpg:notRepairedDamageyes                           6.326
## powerPS:fuelTypeother:notRepairedDamageno                          0.524
## powerPS:fuelTypepetrol:notRepairedDamageno                        40.024
## powerPS:fuelTypecng:notRepairedDamageno                            2.282
## powerPS:fuelTypediesel:notRepairedDamageno                        20.151
## powerPS:fuelTypeelectric:notRepairedDamageno                      -5.666
## powerPS:fuelTypehybrid:notRepairedDamageno                        -1.075
## powerPS:fuelTypelpg:notRepairedDamageno                            8.032
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes                 4.466
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes                9.918
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                   4.904
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes               12.345
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes             -1.658
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes                5.344
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                   9.646
## fuelTypeother:I(log(powerPS)):notRepairedDamageno                 10.787
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno                53.686
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                   18.333
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno                60.660
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno              16.979
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno                34.482
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                   51.942
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes   106.307
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes  105.780
## yearOfRegistration:collector_statusmodern:notRepairedDamageno    252.585
## yearOfRegistration:collector_statusvintage:notRepairedDamageno   252.110
## collector_statusmodern:kilometer:notRepairedDamageyes            -38.216
## collector_statusvintage:kilometer:notRepairedDamageyes           -41.814
## collector_statusmodern:kilometer:notRepairedDamageno            -164.767
## collector_statusvintage:kilometer:notRepairedDamageno           -141.871
##                                                                 Pr(>|t|)
## (Intercept)                                                      < 2e-16
## gearboxautomatic:notRepairedDamageyes                            < 2e-16
## gearboxmanual:notRepairedDamageyes                               < 2e-16
## gearboxautomatic:notRepairedDamageno                            1.51e-08
## gearboxmanual:notRepairedDamageno                                     NA
## vehicleTypepeople carrier:notRepairedDamageyes                  1.32e-05
## vehicleTypeconvertible:notRepairedDamageyes                      < 2e-16
## vehicleTypecoupe:notRepairedDamageyes                           4.46e-05
## vehicleTypesmall car:notRepairedDamageyes                       0.001693
## vehicleTypeestate:notRepairedDamageyes                          1.64e-08
## vehicleTypesedan:notRepairedDamageyes                           0.347168
## vehicleTypeSUV:notRepairedDamageyes                              < 2e-16
## vehicleTypepeople carrier:notRepairedDamageno                    < 2e-16
## vehicleTypeconvertible:notRepairedDamageno                       < 2e-16
## vehicleTypecoupe:notRepairedDamageno                            2.37e-13
## vehicleTypesmall car:notRepairedDamageno                        0.005407
## vehicleTypeestate:notRepairedDamageno                            < 2e-16
## vehicleTypesedan:notRepairedDamageno                            2.45e-08
## vehicleTypeSUV:notRepairedDamageno                               < 2e-16
## brand_cat.L:notRepairedDamageyes                                0.000146
## brand_cat.Q:notRepairedDamageyes                                0.041367
## brand_cat.C:notRepairedDamageyes                                8.53e-07
## brand_cat^4:notRepairedDamageyes                                7.13e-07
## brand_cat^5:notRepairedDamageyes                                0.417773
## brand_cat^6:notRepairedDamageyes                                 < 2e-16
## brand_cat^7:notRepairedDamageyes                                 < 2e-16
## brand_cat.L:notRepairedDamageno                                  < 2e-16
## brand_cat.Q:notRepairedDamageno                                 2.72e-08
## brand_cat.C:notRepairedDamageno                                  < 2e-16
## brand_cat^4:notRepairedDamageno                                  < 2e-16
## brand_cat^5:notRepairedDamageno                                  < 2e-16
## brand_cat^6:notRepairedDamageno                                  < 2e-16
## brand_cat^7:notRepairedDamageno                                  < 2e-16
## powerPS:fuelTypeother:notRepairedDamageyes                      0.457362
## powerPS:fuelTypepetrol:notRepairedDamageyes                      < 2e-16
## powerPS:fuelTypecng:notRepairedDamageyes                        0.387477
## powerPS:fuelTypediesel:notRepairedDamageyes                      < 2e-16
## powerPS:fuelTypeelectric:notRepairedDamageyes                   0.085257
## powerPS:fuelTypehybrid:notRepairedDamageyes                     0.742448
## powerPS:fuelTypelpg:notRepairedDamageyes                        2.52e-10
## powerPS:fuelTypeother:notRepairedDamageno                       0.600297
## powerPS:fuelTypepetrol:notRepairedDamageno                       < 2e-16
## powerPS:fuelTypecng:notRepairedDamageno                         0.022475
## powerPS:fuelTypediesel:notRepairedDamageno                       < 2e-16
## powerPS:fuelTypeelectric:notRepairedDamageno                    1.47e-08
## powerPS:fuelTypehybrid:notRepairedDamageno                      0.282448
## powerPS:fuelTypelpg:notRepairedDamageno                         9.60e-16
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes              7.98e-06
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes              < 2e-16
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                9.41e-07
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes              < 2e-16
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes           0.097307
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes             9.09e-08
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                 < 2e-16
## fuelTypeother:I(log(powerPS)):notRepairedDamageno                < 2e-16
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno               < 2e-16
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                  < 2e-16
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno               < 2e-16
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno             < 2e-16
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno               < 2e-16
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                  < 2e-16
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes   < 2e-16
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes  < 2e-16
## yearOfRegistration:collector_statusmodern:notRepairedDamageno    < 2e-16
## yearOfRegistration:collector_statusvintage:notRepairedDamageno   < 2e-16
## collector_statusmodern:kilometer:notRepairedDamageyes            < 2e-16
## collector_statusvintage:kilometer:notRepairedDamageyes           < 2e-16
## collector_statusmodern:kilometer:notRepairedDamageno             < 2e-16
## collector_statusvintage:kilometer:notRepairedDamageno            < 2e-16
##                                                                    
## (Intercept)                                                     ***
## gearboxautomatic:notRepairedDamageyes                           ***
## gearboxmanual:notRepairedDamageyes                              ***
## gearboxautomatic:notRepairedDamageno                            ***
## gearboxmanual:notRepairedDamageno                                  
## vehicleTypepeople carrier:notRepairedDamageyes                  ***
## vehicleTypeconvertible:notRepairedDamageyes                     ***
## vehicleTypecoupe:notRepairedDamageyes                           ***
## vehicleTypesmall car:notRepairedDamageyes                       ** 
## vehicleTypeestate:notRepairedDamageyes                          ***
## vehicleTypesedan:notRepairedDamageyes                              
## vehicleTypeSUV:notRepairedDamageyes                             ***
## vehicleTypepeople carrier:notRepairedDamageno                   ***
## vehicleTypeconvertible:notRepairedDamageno                      ***
## vehicleTypecoupe:notRepairedDamageno                            ***
## vehicleTypesmall car:notRepairedDamageno                        ** 
## vehicleTypeestate:notRepairedDamageno                           ***
## vehicleTypesedan:notRepairedDamageno                            ***
## vehicleTypeSUV:notRepairedDamageno                              ***
## brand_cat.L:notRepairedDamageyes                                ***
## brand_cat.Q:notRepairedDamageyes                                *  
## brand_cat.C:notRepairedDamageyes                                ***
## brand_cat^4:notRepairedDamageyes                                ***
## brand_cat^5:notRepairedDamageyes                                   
## brand_cat^6:notRepairedDamageyes                                ***
## brand_cat^7:notRepairedDamageyes                                ***
## brand_cat.L:notRepairedDamageno                                 ***
## brand_cat.Q:notRepairedDamageno                                 ***
## brand_cat.C:notRepairedDamageno                                 ***
## brand_cat^4:notRepairedDamageno                                 ***
## brand_cat^5:notRepairedDamageno                                 ***
## brand_cat^6:notRepairedDamageno                                 ***
## brand_cat^7:notRepairedDamageno                                 ***
## powerPS:fuelTypeother:notRepairedDamageyes                         
## powerPS:fuelTypepetrol:notRepairedDamageyes                     ***
## powerPS:fuelTypecng:notRepairedDamageyes                           
## powerPS:fuelTypediesel:notRepairedDamageyes                     ***
## powerPS:fuelTypeelectric:notRepairedDamageyes                   .  
## powerPS:fuelTypehybrid:notRepairedDamageyes                        
## powerPS:fuelTypelpg:notRepairedDamageyes                        ***
## powerPS:fuelTypeother:notRepairedDamageno                          
## powerPS:fuelTypepetrol:notRepairedDamageno                      ***
## powerPS:fuelTypecng:notRepairedDamageno                         *  
## powerPS:fuelTypediesel:notRepairedDamageno                      ***
## powerPS:fuelTypeelectric:notRepairedDamageno                    ***
## powerPS:fuelTypehybrid:notRepairedDamageno                         
## powerPS:fuelTypelpg:notRepairedDamageno                         ***
## fuelTypeother:I(log(powerPS)):notRepairedDamageyes              ***
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageyes             ***
## fuelTypecng:I(log(powerPS)):notRepairedDamageyes                ***
## fuelTypediesel:I(log(powerPS)):notRepairedDamageyes             ***
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageyes           .  
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageyes             ***
## fuelTypelpg:I(log(powerPS)):notRepairedDamageyes                ***
## fuelTypeother:I(log(powerPS)):notRepairedDamageno               ***
## fuelTypepetrol:I(log(powerPS)):notRepairedDamageno              ***
## fuelTypecng:I(log(powerPS)):notRepairedDamageno                 ***
## fuelTypediesel:I(log(powerPS)):notRepairedDamageno              ***
## fuelTypeelectric:I(log(powerPS)):notRepairedDamageno            ***
## fuelTypehybrid:I(log(powerPS)):notRepairedDamageno              ***
## fuelTypelpg:I(log(powerPS)):notRepairedDamageno                 ***
## yearOfRegistration:collector_statusmodern:notRepairedDamageyes  ***
## yearOfRegistration:collector_statusvintage:notRepairedDamageyes ***
## yearOfRegistration:collector_statusmodern:notRepairedDamageno   ***
## yearOfRegistration:collector_statusvintage:notRepairedDamageno  ***
## collector_statusmodern:kilometer:notRepairedDamageyes           ***
## collector_statusvintage:kilometer:notRepairedDamageyes          ***
## collector_statusmodern:kilometer:notRepairedDamageno            ***
## collector_statusvintage:kilometer:notRepairedDamageno           ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2371 on 240703 degrees of freedom
## Multiple R-squared:  0.7649, Adjusted R-squared:  0.7649 
## F-statistic: 1.169e+04 on 67 and 240703 DF,  p-value: < 2.2e-16

So this regression achieves \(R^2=0.765\), which seems decent. There are many unsignificant parameters though, most of them for low-count fuel types such as electric or natural gas. To view these results more explicitly, let’s predict price on a few observations selected at random:

##     pred_price price    vehicleType yearOfRegistration   gearbox powerPS
##  1:  5757.3994  6900         estate               2003 automatic     177
##  2:  3462.3774  4200         estate               2003    manual     143
##  3:  5393.7412  6900          sedan               2008    manual     143
##  4:   873.6259  1399         estate               2001    manual      90
##  5: 12265.8002 14500         estate               2013    manual     140
##  6:  2338.2414  3000 people carrier               1996    manual     102
##  7:  6668.3639  5100         estate               2006 automatic     185
##  8:  4891.4393  4950         estate               2008    manual     105
##  9:  2013.9319  2850      small car               2002    manual      90
## 10:   754.9463   590      small car               1994    manual      45
## 11:  4163.1543  5620      small car               2010    manual      69
## 12:  3223.3984  4200          sedan               2002    manual     150
## 13:  4730.7793  5299          sedan               2006 automatic     140
## 14: 10812.7010  7999      small car               2014 automatic      71
## 15:  3366.7457  1600          coupe               1993    manual     150
## 16:   677.4256   990      small car               1997    manual      54
## 17:   899.4355   400          sedan               1996    manual      60
## 18:  4230.1399  2900 people carrier               2005    manual     101
## 19:  7015.7365 10399         estate               2008 automatic     140
## 20: 33452.0384 27999    convertible               2013 automatic     184
##           model kilometer monthOfRegistration fuelType         brand
##  1:    e_klasse    150000                   8   diesel mercedes_benz
##  2:    c_klasse    150000                   2   petrol mercedes_benz
##  3:      lancer    100000                  10      lpg    mitsubishi
##  4:       focus    150000                   5   diesel          ford
##  5:       astra     20000                  11   petrol          opel
##  6: transporter    150000                   7   diesel    volkswagen
##  7:         v70    150000                  10   diesel         volvo
##  8:      passat    150000                   1   diesel    volkswagen
##  9:         one    150000                  12   petrol          mini
## 10:     1_reihe    150000                   7   petrol       peugeot
## 11:         500     70000                   1   petrol          fiat
## 12:      passat    150000                   3   petrol    volkswagen
## 13:     octavia    150000                   2   diesel         skoda
## 14:      fortwo     30000                  10   petrol         smart
## 15:         3er    150000                   4   petrol           bmw
## 16:       micra    150000                   9   petrol        nissan
## 17:        golf    150000                   1   petrol    volkswagen
## 18:      meriva    125000                   6   diesel          opel
## 19:          a6    150000                  10   diesel          audi
## 20:     z_reihe     50000                   4   petrol           bmw
##     notRepairedDamage ad_up_time km_cat     brand_cat log_price
##  1:                no 1015.75347 150000       premium  3.838849
##  2:                no  582.92222 150000       premium  3.623249
##  3:                no  249.64861 100000     mid_minus  3.838849
##  4:               yes  815.22153 150000     mid_minus  3.145818
##  5:                no  119.52153  20000     mid_minus  4.161368
##  6:                no    0.00000 150000      mid_plus  3.477121
##  7:                no  216.12639 150000 premium_minus  3.707570
##  8:                no  105.88333 150000      mid_plus  3.694605
##  9:                no    0.00000 150000 premium_minus  3.454845
## 10:                no   96.21597 150000     mid_range  2.770852
## 11:                no   85.08819  70000     mid_minus  3.749736
## 12:                no 1477.16597 150000      mid_plus  3.623249
## 13:                no  651.24861 150000   budget_plus  3.724194
## 14:                no  239.71250  30000      mid_plus  3.903036
## 15:                no  230.05833 150000       premium  3.204120
## 16:                no  200.43542 150000     mid_range  2.995635
## 17:                no  349.80000 150000      mid_plus  2.602060
## 18:                no    0.00000 125000     mid_minus  3.462398
## 19:                no 1011.54375 150000       premium  4.016992
## 20:                no  349.46319  50000       premium  4.447143
##     collector_status
##  1:           modern
##  2:           modern
##  3:           modern
##  4:           modern
##  5:           modern
##  6:           modern
##  7:           modern
##  8:           modern
##  9:           modern
## 10:          vintage
## 11:           modern
## 12:           modern
## 13:           modern
## 14:           modern
## 15:          vintage
## 16:           modern
## 17:           modern
## 18:           modern
## 19:           modern
## 20:           modern
We see that some predictions are really close whereas others are widely off the mark. Let’s make this more visual:
The red line represents identity

The red line represents identity

The higher the price, the more the predictions seem to get wrong, which is intuitively logical seeing that we predicted log_price and not price directly. In log10 coordinates, this phenomenon disappears but the opposite appears: Errors seem larger for low prices:

The red line represents identity

The red line represents identity

We seem to generally underestimate prices a little. Let’s have a look at the residuals:

The residuals vs. fitted plot shows that there is still some pattern left in the data, especially toward the high-end of price predictions where we tend to under-estimate the outcome variable. The normal Q-Q curve shows a lot of departure from the ideal which is characteristic of heavy-tailed data such as we have here. The Scale-Location plot shows that the data is not completely homoscedastic. Finally, there does not seem to be any overly influencial observations according to the Residuals vs. Leverage plot, but we did a lot of cleaning up beforehand to get rid of many outliers. Looking at the three observations for which we are provided indices in the plot, we see that they are all alternative fuel vehicles:

##    price vehicleType yearOfRegistration   gearbox powerPS model kilometer
## 1:  1650       coupe               2000 automatic     193   clk    150000
## 2: 24850       sedan               2012 automatic     306 other     60000
## 3:  1642       coupe               2015 automatic     362 other     10000
##    monthOfRegistration fuelType         brand notRepairedDamage ad_up_time
## 1:                   7      cng mercedes_benz               yes 1553.35833
## 2:                   8   hybrid           bmw               yes   94.83611
## 3:                   3 electric           bmw                no 1496.37500
##    km_cat brand_cat log_price collector_status
## 1: 150000   premium  3.217484           modern
## 2:  60000   premium  4.395326           modern
## 3:  10000   premium  3.215373           modern

Reflection

Using a dataset made available on Kaggle and using data from the German eBay used car ads, I started by doing significant amounts of data cleaning, translations and conversions. I was then able to analyse each variable in turn and noticed that the price variable in particular had a very negatively skewed distribution and that a log transform was required. This initial phase of the analysis also allowed me to discover more issues with the data. Year of registration, price and mileage all had non-sensical entries, most probably due to human errors and maybe cheekiness. Most variables have large amounts of missing values. The brand and model variables tell us a lot about the German market, one of the most high-end market in Europe. Finally, I established that the ad up time was probably not usable as a predictor.

I was then able to look at associations between variables. I established that among the continuous variables, power had highest correlation to price, especially in log form, but with different profiles depending on fuel and transmission types. Looking at mileage was interesting because it exhibited two distinct correlations to price, one for “modern” cars and one for “vintage”. Mileage, another strong contributor, is unfortunately not really a continuous variable in this dataset, as it can only take a few distinct values.

Most categorical variables are also useful to the model. Fuel, gearbox and vehicle type all have associations with price. Unrepaired damages are associated to such a price drop that they litterally create a parallel market, which we must make sure that our linear model is able to capture as distinct from the bulk of the transactions.

Finally, the brand category variable that I manually created turned out to be quite useful as well as it exhibits a significant association to price.

Using the knowledge I had gained from this exploratory analysis, I built a linear model that explains 76.5% of the variance in the data. While this is not a very high level of accuracy, it is a very crude model that cannot account for all associations between variables. For instance, the price of vintage cars can vary differently depending on the brand of the vehicle. Since brand contains over 40 levels, I deliberately aggregated it into categories that try to reflect current market perception, not that of 25 or 30 years ago, so it would not necessarily be adequate as a predictor for vintage car prices. I also left model out of my set of predictors because it has over 200 levels, which is too complex for what I was trying to achieve here. Other potential predictors such as post code and free text description were also left out; while they would require a lot more work to provide sensible information, they could add significantly to the model.

Among other limitations, there is the high level of human error that this dataset seems to contain. To understand the source of some of those errors or to identify some new ones, it would be useful to be able to access the German eBay website as it was when data collection was performed. Similarly, I would like to better understand the criteria and process that were applied when scraping the website.

Finally, it would be interesting to compare this dataset to used car data from other sources to see if the model developped here would give us the same level of accuracy, or if eBay market prices are specific.

As a closing observation, I would point out that we often hear that domain-knowledge is critical to a good data analyst. I don’t know if it is a general rule, but in the present project, I found that an in-depth knowledge of the automotive market definitely helped me make more meaningful analyses, develop relevant theories and engineer useful variables.